Take-home Exercise 3

Explore different perspectives and approaches to create more truthful and enlightening data visualisation

Huan Li https://linkedin.com/in/huan-li-ab7498124/ (SMU, SCIS, Master of IT in Business)https://scis.smu.edu.sg/master-it-business/about-mitb-main
2022-05-15

1. Overview

Anticipating rapid growth, the city of Engagement, Ohio USA is doing a participatory urban planning exercise to understand the current state of the city and identify opportunities for future growth. About 1000 representative residents in this modest-sized city have agreed to provide data using the city’s urban planning app, which records the places they visit, their spending, and their purchases, among other things. From these volunteers, the city will have data to assist with their major community revitalization efforts, including how to allocate a very large city renewal grant they have recently received.

Economic considers the financial health of the city. How does the financial health of the residents change over the period covered by the dataset? How do wages compare to the overall cost of living in Engagement? Are there groups that appear to exhibit similar patterns?

In this exercise, we will explore different perspectives and approaches to create more enlightening data visualisation on dataset VAST Challenge 2022. The operation was carried out on Rstudio and main packages used are tidyverse and ggplot2 extensions.

2. Data Preparation

2.1 Installing and loading the required libraries

Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.

The chunk code on the right will do the trick.

hide
packages = c('tidyverse', 'knitr', 'ggdist', 'ggridges',
             'scales', 'grid', 'gridExtra','plotly',
             'ggrepel', 'formattable', 'patchwork',
             'lubridate', 'data.table')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

2.2 Importing the dataset

The code chunk below imports participants.csv and FinancialJournal.csv into R environment using read_csv() function of readr package.

hide
participants <- read_csv('data/Participants.csv')
financial <- read_csv('data/FinancialJournal.csv')

It is always a good practice to examine the imported data frame before further analysis is performed.

For example, kable() can be used to review the structure of the imported data frame.

Let’s take an overview of the datasets.

hide
kable(head(participants))
participantId householdSize haveKids age educationLevel interestGroup joviality
0 3 TRUE 36 HighSchoolOrCollege H 0.0016267
1 3 TRUE 25 HighSchoolOrCollege B 0.3280865
2 3 TRUE 35 HighSchoolOrCollege A 0.3934696
3 3 TRUE 21 HighSchoolOrCollege I 0.1380634
4 3 TRUE 43 Bachelors H 0.8573967
5 3 TRUE 32 HighSchoolOrCollege D 0.7729578
hide
kable(head(financial))
participantId timestamp category amount
0 2022-03-01 Wage 2472.50756
0 2022-03-01 Shelter -554.98862
0 2022-03-01 Education -38.00538
1 2022-03-01 Wage 2046.56221
1 2022-03-01 Shelter -554.98862
1 2022-03-01 Education -38.00538

2.3 Data Wrangling

In order to understand the financial health of the residents change over the period, we need to derive income, overall cost and balance of residents in a monthly basis.

2.3.1 Dealling with time interval

Monthly income/cost for residents need to be derived to view the change over recorded 15 months. Code chunk below shows how we change time format to monthly basis.

hide
monthlyFinancial <- financial %>% 
  mutate(yearmonth = format(as.Date(timestamp), "%Y.%m")) %>% 
  select(-timestamp)
monthlyFinancial
# A tibble: 1,856,330 x 4
   participantId category  amount yearmonth
           <dbl> <chr>      <dbl> <chr>    
 1             0 Wage      2473.  2022.03  
 2             0 Shelter   -555.  2022.03  
 3             0 Education  -38.0 2022.03  
 4             1 Wage      2047.  2022.03  
 5             1 Shelter   -555.  2022.03  
 6             1 Education  -38.0 2022.03  
 7             2 Wage      2437.  2022.03  
 8             2 Shelter   -557.  2022.03  
 9             2 Education  -12.8 2022.03  
10             3 Wage      2367.  2022.03  
# ... with 1,856,320 more rows

2.3.2 Pivot Dataframe

Firstly, we need to use group_by to group individual expense and income category. And then use summarise function to summarize each category.

hide
summarizedFinancial <- monthlyFinancial %>% 
  group_by(participantId, category, yearmonth) %>% 
  summarise(monthly_financial = sum(amount))
summarizedFinancial
# A tibble: 55,498 x 4
# Groups:   participantId, category [4,142]
   participantId category  yearmonth monthly_financial
           <dbl> <chr>     <chr>                 <dbl>
 1             0 Education 2022.03               -76.0
 2             0 Education 2022.04               -38.0
 3             0 Education 2022.05               -38.0
 4             0 Education 2022.06               -38.0
 5             0 Education 2022.07               -38.0
 6             0 Education 2022.08               -38.0
 7             0 Education 2022.09               -38.0
 8             0 Education 2022.10               -38.0
 9             0 Education 2022.11               -38.0
10             0 Education 2022.12               -38.0
# ... with 55,488 more rows

Then, the dataframe need to be pivoted using code chunk below.

hide
Financial <- summarizedFinancial %>% 
  pivot_wider(names_from = category, values_from = monthly_financial)
Financial[is.na(Financial)] = 0
Financial
# A tibble: 13,331 x 8
# Groups:   participantId [1,011]
   participantId yearmonth Education  Food Recreation Shelter   Wage
           <dbl> <chr>         <dbl> <dbl>      <dbl>   <dbl>  <dbl>
 1             0 2022.03       -76.0 -268.     -349.   -1110. 11932.
 2             0 2022.04       -38.0 -266.     -219.    -555.  8637.
 3             0 2022.05       -38.0 -265.     -383.    -555.  9048.
 4             0 2022.06       -38.0 -257.     -466.    -555.  9048.
 5             0 2022.07       -38.0 -270.    -1070.    -555.  8637.
 6             0 2022.08       -38.0 -262.     -314.    -555.  9459.
 7             0 2022.09       -38.0 -256.     -295.    -555.  9048.
 8             0 2022.10       -38.0 -267.      -25.0   -555.  8637.
 9             0 2022.11       -38.0 -261.     -377.    -555.  9048.
10             0 2022.12       -38.0 -266.     -357.    -555.  9048.
# ... with 13,321 more rows, and 1 more variable:
#   RentAdjustment <dbl>

2.3.3 Deriving monthly-income, cost-of-ling and balance

To show the change of financial situation of residents during this 15 months,we need to calculate the monthly income , monthly living cost as well as monthly balance and then used them to visualize in the next part.

Residents’ monthly income is derived by calculating the sum of wage.

Cost of living is made up of expenses from education, food, recreation, shelter, and offset the rent adjustment. Residents’ monthly cost of living is derived by calculating the sum of above expenses.

hide
FINANCIAL <- Financial %>% 
  mutate(monthly_cost = Education + Food + Recreation
         + Shelter + RentAdjustment) %>% 
  mutate(monthly_income = Wage) %>% 
  mutate(monthly_balance = monthly_income + monthly_cost)
FINANCIAL
# A tibble: 13,331 x 11
# Groups:   participantId [1,011]
   participantId yearmonth Education  Food Recreation Shelter   Wage
           <dbl> <chr>         <dbl> <dbl>      <dbl>   <dbl>  <dbl>
 1             0 2022.03       -76.0 -268.     -349.   -1110. 11932.
 2             0 2022.04       -38.0 -266.     -219.    -555.  8637.
 3             0 2022.05       -38.0 -265.     -383.    -555.  9048.
 4             0 2022.06       -38.0 -257.     -466.    -555.  9048.
 5             0 2022.07       -38.0 -270.    -1070.    -555.  8637.
 6             0 2022.08       -38.0 -262.     -314.    -555.  9459.
 7             0 2022.09       -38.0 -256.     -295.    -555.  9048.
 8             0 2022.10       -38.0 -267.      -25.0   -555.  8637.
 9             0 2022.11       -38.0 -261.     -377.    -555.  9048.
10             0 2022.12       -38.0 -266.     -357.    -555.  9048.
# ... with 13,321 more rows, and 4 more variables:
#   RentAdjustment <dbl>, monthly_cost <dbl>, monthly_income <dbl>,
#   monthly_balance <dbl>

2.3.4 Join tables

In order to show the income and consumption patterns in different groups, we will combine FINANCIAL and participants dataframe together.

hide
combine <- FINANCIAL %>% 
  left_join(participants, by = "participantId")
combine
# A tibble: 13,331 x 17
# Groups:   participantId [1,011]
   participantId yearmonth Education  Food Recreation Shelter   Wage
           <dbl> <chr>         <dbl> <dbl>      <dbl>   <dbl>  <dbl>
 1             0 2022.03       -76.0 -268.     -349.   -1110. 11932.
 2             0 2022.04       -38.0 -266.     -219.    -555.  8637.
 3             0 2022.05       -38.0 -265.     -383.    -555.  9048.
 4             0 2022.06       -38.0 -257.     -466.    -555.  9048.
 5             0 2022.07       -38.0 -270.    -1070.    -555.  8637.
 6             0 2022.08       -38.0 -262.     -314.    -555.  9459.
 7             0 2022.09       -38.0 -256.     -295.    -555.  9048.
 8             0 2022.10       -38.0 -267.      -25.0   -555.  8637.
 9             0 2022.11       -38.0 -261.     -377.    -555.  9048.
10             0 2022.12       -38.0 -266.     -357.    -555.  9048.
# ... with 13,321 more rows, and 10 more variables:
#   RentAdjustment <dbl>, monthly_cost <dbl>, monthly_income <dbl>,
#   monthly_balance <dbl>, householdSize <dbl>, haveKids <lgl>,
#   age <dbl>, educationLevel <chr>, interestGroup <chr>,
#   joviality <dbl>

3. Visulisations and Insights

3.1 How does the financial health of the residents change over the period?

To visualise the financial change during this 15 months, we will use ridge plot to show the economic situation

hide
p1 <- ggplot(combine, 
             aes(x=Wage, 
                 y=combine$yearmonth, 
                 fill = factor(stat(quantile)))) +
  stat_density_ridges(geom = "density_ridges_gradient", 
                      calc_ecdf = TRUE,
                      quantiles = 4, 
                      quantile_lines = TRUE) +
  scale_fill_viridis_d(name = "Quartiles") +
  labs(x= "Wage",
       y= "Time",
       title="Distribution of Residents' Wage")
p1

According to above ridge plot, we can know that residents’ wage in 2022 March is higher than the following 14 months.

hide
p2 <- ggplot(combine,
            aes(x = combine$monthly_balance,
                y = combine$yearmonth))+
  geom_density_ridges(jittered_points = TRUE,
                      position = position_points_jitter(width = 0.05, 
                                                        height = 0),
                      point_shape = '|', 
                      point_size = 3, 
                      point_alpha = 1, 
                      alpha = 0.7,) +
  stat_density_ridges(geom = "density_ridges_gradient", calc_ecdf = TRUE) +
  geom_density_ridges_gradient(scale = 2, rel_min_height = 0.01) +
  scale_fill_viridis_c(name = "Monthly Balance", direction = -1) +
  theme(axis.title.y=element_text(angle=0),
      axis.line = element_line(color='grey'), 
      plot.title = element_text(hjust = 0.5),
      axis.title.y.left = element_text(vjust = 0.5,), 
      axis.text = element_text(face="bold")) +
  labs(x= "Monthly Balance",
       y= "Time",
       title="Distribution of Residents' Monthly Balance")

p2

Accordingly, the distribution of monthly balance is in a similar pattern. And the dot plot delow the ridge plot shows that people with higher balance is affected even more serious.

3.2 How do wages compare to the overall cost of living in Engagement?

hide
plot_ly(data = combine,
        x = ~Wage,
        y = ~monthly_cost,
        text = ~paste("Period:", yearmonth,
                      "<br>Balance:", monthly_balance),
        color = ~yearmonth) %>%
  layout(title = 'Monthly Wage versus Monthly Living Cost',
         ylabel = 'Monthly Cost of Living')

Residents

3.3 Are there groups that appear to exhibit similar patterns?

hide
edu_financial <- combine %>% 
  select(educationLevel, yearmonth, monthly_income, 
         monthly_cost, monthly_balance) %>% 
  group_by(educationLevel) %>%
  summarise(avgIncome = mean(monthly_income),
            medianIncome =median(monthly_income),
            avgCost = mean(monthly_cost),
            avgBalance = mean(monthly_balance))
edu_financial
# A tibble: 4 x 5
  educationLevel      avgIncome medianIncome avgCost avgBalance
  <chr>                   <dbl>        <dbl>   <dbl>      <dbl>
1 Bachelors               4989.        4296.  -1423.      3566.
2 Graduate                6154.        5394.  -1478.      4677.
3 HighSchoolOrCollege     3151.        2785.  -1374.      1777.
4 Low                     2661.        2373.  -1260.      1401.
hide
interest_financial <- combine %>% 
  select(interestGroup, yearmonth, monthly_income, 
         monthly_cost, monthly_balance) %>% 
  group_by(interestGroup,yearmonth) %>%
  summarise(avgIncome = mean(monthly_income),
            avgCost = mean(monthly_cost),
            avgBalance = mean(monthly_balance))
interest_financial
# A tibble: 150 x 5
# Groups:   interestGroup [10]
   interestGroup yearmonth avgIncome avgCost avgBalance
   <chr>         <chr>         <dbl>   <dbl>      <dbl>
 1 A             2022.03       6363.  -2091.      4272.
 2 A             2022.04       3961.  -1441.      2520.
 3 A             2022.05       4140.  -1377.      2762.
 4 A             2022.06       4132.  -1354.      2778.
 5 A             2022.07       3975.  -1386.      2588.
 6 A             2022.08       4308.  -1340.      2967.
 7 A             2022.09       4134.  -1354.      2780.
 8 A             2022.10       3969.  -1375.      2594.
 9 A             2022.11       4129.  -1314.      2815.
10 A             2022.12       4147.  -1363.      2784.
# ... with 140 more rows